Complexity Through Nonextensivity 
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The problem of defining and studying complexity of a time series has interested people for years. In 
the context of dynamical systems, Grassberger has suggested that a slow approach of the entropy to 
its extensive asymptotic limit is a sign of complexity. We investigate this idea further by information 
theoretic and statistical mechanics techniques and show that these arguments can be made precise, 
and that they generalize many previous approaches to complexity, in particular unifying ideas from 
the physics literature with ideas from learning and coding theory; there are even connections of this 
statistical approach to algorithmic or Kolmogorov complexity. Moreover, a set of simple axioms 
similar to those used by Shannon in his development of information theory allows us to prove that 
the divergent part of the subextensive component of the entropy is a unique complexity measure. We 
classify time series by their complexities and demonstrate that beyond the 'logarithmic' complexity 
classes widely anticipated in the literature there are qualitatively more complex, 'power-law' classes 
which deserve more attention. 



PACS 

The problem of quantifying complexity is very old. In- 
terest in the field has been fueled by three sorts of ques- 
tions. First, one would like to make precise an impression 
that some systems, such as life on earth or a turbulent 
fluid flow, evolve toward a state of higher complexity, 
and one would like to classify these states; this is the 
realm of dynamical systems theory. Second, in choosing 
among different models that describe an experiment, one 
wants to quantify a preference for simpler explanations 
or, equivalently, provide a penalty for complex models 
that can be weighed against the more conventional good- 
ness of flt criteria; this type of question usually is inves- 
tigated in statistics. Finally, there are questions about 
how hard it is to compute or to describe the state of a 
complex system; this is the area of formal mathematics 
and computer science. 

Research in each of these three directions has given 
birth to numerous definitions of complexity. The usual 
objective is to make these definitions focused enough to 
be operational in particular contexts but general enough 
to connect with our intuitive notions. For many years 
the dominant candidate for a universal measure has been 
the mathematically rigorous notion of Kolmogorov or al- 
gorithmic complexity that measures (roughly) the min- 
imum length of a computer program that can recreate 
the observed time series Unfortunately there is no 
algorithm that can calculate the Kolmogorov complexity 
of all data sets. Therefore, for applications to statistics, 
Rissanen ^ and others have developed a new concept: 
stochastic complexity of the data with respect to a par- 
ticular class of models, which measures the shortest total 
description of the data and the model within the class, 
but cannot rule out the possibility that a different model 
class could generate a shorter code. 

The main difficulty of all these approaches is that the 



Kolmogorov complexity is closely related to the Shannon 
entropy, which means that it measures something closer 
to our intuitive concept of randomness than to the intu- 
itive concept of complexity . A true random string can- 
not be compressed and hence requires a long description, 
yet the physical process that generates this string may be 
very simple. As physicists, our intuitive notions of com- 
plexity correspond to statements about the underlying 
process, and not directly to the description length or Kol- 
mogorov complexity: a dynamics with a predictable con- 
stant output (small algorithmic complexity) is as trivial 
as one for which the output is completely unpredictable 
and random (large algorithmic complexity), while really 
complex processes lie somewhere in between. 

The two extreme cases, however, have one feature in 
common: the entropy of the output strings (or, equiva- 
lently, the Kolmogorov complexity of a typical one) ei- 
ther is a fixed constant or grows exactly linearly with 
the length of the strings. In both cases, corrections to 
the asymptotic behavior do not grow with the size of the 
data set. This allowed Grassberger Q to identify the slow 
approach of the entropy to its extensive limit as a sign of 
complexity. He has proposed several functions to analyze 
this slow approach and studied systems that exhibited a 
broad range of complexity properties. 

To deal with the same problem, Rissanen has empha- 
sized strongly that fitting a model to data represents an 
encoding of those data, or predicting future data. Shorter 
encodings generally mean better prediction or generaliza- 
tion. However, much of the code usually describes the 
meaningless, nongeneralizable "noise" — statistical fluc- 
tuations within the model. Only model description is 
relevant to prediction, and this part of the code has been 
termed the model complexity While systems with 
model complexity of very different types are known, the 
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two extreme examples above are similar: it only takes a 
fixed number of bits to code either a call to a random 
number generator or to a constant function. 

The present work may be viewed as expanding on the 
notions of subextensivity and effective prediction. We 
construct a coherent theory that brings these ideas to- 
gether in an intuitive way, but nonetheless is sufficiently 
general to be applied in many different contexts. We will 
show that with only a little bit of work Grassberger's def- 
initions may be made as mathematically precise as they 
are aesthetically pleasing. Finally, we will argue that the 
definitions are unique if one accepts a set of simple ax- 
ioms in the spirit of Shannon's original work, and that 
these definitions relate to the usual Kolmogorov complex- 
ity in a straightforward way. Much of this paper follows 
closely a more detailed analysis in Ref. to which we 
refer for calculation details and a thorough discussion of 
the relevant literature. 

Our path to connecting the various complexity mea- 
sures begins by noticing that the subextensive compo- 
nents of entropy identified by Grassberger in fact deter- 
mine the information available for making predictions. 
This also suggests a connection to the importance or 
value of information, especially in a biological or eco- 
nomic context: information is valuable if it can be used 
to guide our actions, but actions take time and hence ob- 
served data can be useful only to the extent that those 
data inform us about the state of the world at later times. 
It would be attractive if what we identify as "complex" 
in a time series were also the "useful" or "meaningful" 
components. 

While prediction may come in various forms, depend- 
ing on context, information theory allows us to treat all 
of them on the same footing. For this we only need to 
recognize that all predictions are probabilistic, and that, 
even before we look at the data, we know that certain fu- 
tures are more likely than others. This knowledge can 
be summarized by a prior probability distribution for 
the futures. Our observations on the past lead us to 
a new, more tightly concentrated distribution, the distri- 
bution of futures conditional on the past data. Different 
kinds of predictions are different slices through or aver- 
ages over this conditional distribution, but information 
theory quantifies the "concentration" of the distribution 
without making any commitment as to which averages 
will be most interesting. 

Imagine that we observe a stream of data x{t) over 
a time interval — T < t < 0; let all of these past data 
be denoted by the shorthand Xpast- We are interested 
in saying something about the future, so we want to 
know about the data x{t) that will be observed in the 
time interval < t < T'; let these future data be called 
a^futuro- In the absence of any other knowledge, futures 
are drawn from the probability distribution -P(a::futuio), 
while observations of particular past data Xpast tell us 
that futures will be drawn from the conditional distri- 



bution P(2;futurc|2;past)- The greater concentration of the 
conditional distribution can be quantified by the fact that 
it has smaller entropy than the prior distribution, and 
this reduction in entropy is Shannon's definition of the 
information that the past provides about the future. We 
can write the average of this predictive information as 



^prcd 



(T, T') 



log2 



^ (-^future I '^past) 



(1) 



(^future) 

-(l0g2 -P(a;futurc)) - (l0g2 P{Xpi,st)) 

- [-(log2-P(2;tuturo,a;past))] , (2) 



where (• • •) denotes an average over the joint distribution 
of the past and the future, Pfxfuturc, a^past)- 

Each of the terms in Eq. (g) is an entropy. Since we 
are interested in predictability or generalization, which 
are associated with some features of the signal persist- 
ing forever, we may assume stationarity or invariance 
under time translations. Then the entropy of the past 
data depends only on the duration of our observations, 
so we can write — (logj P(xpast)) — S{T), and by the 
same argument — (logj P(xfuturo)) = S{T'). Finally, the 
entropy of the past and the future taken together is the 
entropy of observations on a window of duration T + T' , 
so that -(log2 P(a;futurc, a;past)) = S{T + T'). Putting 
these equations together, we obtain 



Iprcd(r, T') = SiT) + SiT') - SiT + T') 



(3) 



In the same way that the entropy of a gas at fixed den- 
sity is proportional to the volume, the entropy of a time 
series (asymptotically) is proportional to its duration, so 
that limT->oo S{T)/T = iSq; entropy is an extensive quan- 
tity. But from Eq. ^ any extensive component of the 
entropy cancels in the computation of the predictive in- 
formation: predictability is a deviation from extensivity. 
If we write 



S{T)^SoT + S,{T), 



(4) 



then Eq. (||) tells us that the predictive information is 
related only to the nonextensive term Si{T). 

We know two general facts about the behavior of 
Si{T). First, the corrections to extensive behavior 
are positive, Si{T) > 0. Second, the statement that 
entropy is extensive is the statement that the limit 
limr^oo S{T)/T = Sq exists, and for this to be true we 
must also have limT-»oo Si{T)/T = 0. Thus the nonex- 
tensive terms in the entropy must be sitfeextensive, that 
is they must grow with T less rapidly than a linear func- 
tion. Taken together, these facts guarantee that the pre- 
dictive information is positive and subextensive. Further, 
if we let the future extend forward for a very long time, 
T' — > cx), then we can measure the information that our 
sample provides about the entire future. 



Ved(T) - lim Xp,ed(T,r') = 5i(T) , 



(5) 
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and this is precisely equal to the subextensive entropy. 

If we have been observing a time series for a (long) 
time r, then the total amount of data we have collected 
in is measured by the entropy S{T), and at large T this 
is given approximately by SqT. But the predictive infor- 
mation that we have gathered cannot grow linearly with 
time, even if we are making predictions about a future 
which stretches out to infinity. As a result, of the total 
information we have taken in by observing a^past, only a 
vanishing fraction is of relevance to the prediction: 



lim 



Predictive Information /prod(r) 



Total Information 



SiT) 



0. 



(6) 



In this precise sense, most of what we observe is irrele- 
vant to the problem of predicting the future. Since the 
average Kolmogorov complexity of a time series is related 
to its (total) Shannon entropy, this result means also that 
most of the algorithm that is required to encode the data 
encodes aspects of the data that are useless for prediction 
or for guiding our actions based on the data. This is a 
strong indication that the usual notions of Kolmogorov 
complexity in fact do not capture anything at all like the 
(intuitive) utility of the data stream. 

Consider the case where time is measured in discrete 
steps, so that we have seen N time points Xi, X2, ■ • • , x^r. 
How much is there to learn about the underlying pattern 
in these data? In the limit of large number of observa- 
tions, — !■ oo or r — !■ oo the answer to this question is 
surprisingly universal: predictive information may either 
stay finite, or grow to infinity together with T; in the 
latter case the rate of growth may be slow (logarithmic) 
or fast (sublinear power). 

The first possibility, limT_+oo -^prcd(2^) — constant, 
means that no matter how long we observe we gain only 
a finite amount of information about the future. This sit- 
uation prevails, in both extreme cases mentioned above. 
For example, when the dynamics are too regular, such as 
it is for a purely periodic system, complete prediction is 
possible once we know the phase, and if we sample the 
data at discrete times this is a finite amount of informa- 
tion; longer period orbits intuitively are more complex 
and also have larger /prod, but this doesn't change the 
limiting behavior limT^oo /prcd (2^) = constant. 

Similarly, the predictive information can be small when 
the dynamics are irregular but the best predictions are 
controlled only by the immediate past, so that the corre- 
lation times of the observable data are finite. This hap- 
pens, for example, in many physical systems far away 
from phase transitions. Imagine, for example, that we 
observe x{t) at a series of discrete times {t^}, and that 
at each time point we find the value x^- Then we always 
can write the joint distribution of the N data points as 
a product. 



For Markov processes, what we observe at depends 
only on events at the previous time step in-i, so that 

P(x„|{a;i<i<n-i}) = P(a;n|a;n-i), (8) 
and hence the predictive information reduces to 



^prcd 



P(a;n|a;„_i) 



P{Xn) 



(9) 



The maximum possible predictive information in this 
case is the entropy of the distribution of states at one time 
step, which in turn is bounded by the logarithm of the 
number of accessible states. To approach this bound the 
system must maintain memory for a long time, since the 
predictive information is reduced by the entropy of the 
transition probabilities. Thus systems with more states 
and longer memories have larger values of /prod- 

More interesting are those cases in which /prod(P) di- 
verges at large T. In physical systems we know that 
there are critical points where correlation times become 
infinite, so that optimal predictions will be influenced by 
events in the arbitrarily distant past. Under these condi- 
tions the predictive information can grow without bound 
as T becomes large; for many systems the divergence is 
logarithmic, /prod(P oo) oc logT. 

Long range correlation also are important in a time se- 
ries where we can learn some underlying rules. Suppose 
a series of random vector variables {xi} are drawn inde- 
pendently from the same probability distribution Q{x\a), 
and this distribution depends on a (potentially infinite 
dimensional) vector of parameters a. The parameters 
are unknown, and before the series starts they are cho- 
sen randomly from a distribution V{a). In this set- 
ting, at least implicitly, our observations of {xi} pro- 
vide data from which we can learn the parameters a. 
Here we put aside (for the moment) the usual problem 
of learning — which might involve constructing some es- 
timation or regression scheme that determines a "best 
fit" a from the data {xi} — and treat the ensemble of 
data streams P[{xi}] as we would any other set of con- 
figurations in statistical mechanics or dynamical systems 
theory. In particular, we can compute the entropy of 
the distribution P[{xi}] even if we can't provide explicit 
algorithms for solving the learning problem. 

As is shown in the crucial quantity in such anal- 
ysis is the density of models in the vicinity of the 
target a — the parameters that actually generated the 
sequence. For two distributions, a natural distance 
measure is the Kullback-Leibler divergence /)(q;||q:) = 
/ c?a;(5(a;|Q:) log [Q{x\a)/Q(x\cy.)], and the density is 



p{D;a)^ J d^aV{a)5[D - DKL{a\\a) 



(10) 



P{xi,X2, ■ ■ ■ ,xn) = P(a;i)P(a;2|a;i)P(a:3|a:2,a;i) ■ 



(7) 



If p is large as D ^ 0, then one easily can get close to 
the target for many different data; thus they are not very 
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informative. On the other hand, small density means 
that only very particular data lead to a, so they carry a 
lot of predictive information. Therefore, it is clear that 
the density, but not the number of parameters or any 
other simplistic measure, characterizes predictability and 
the complexity of prediction. If, as often is the case for 
dim a < oo, the density behaves in the way common 
to finite dimensional systems of the usual statistical me- 
chanics, 

then the predictive information to the leading order is 
Ip,,d{N)~K/2logN. (12) 

The modern theory of learning is concerned in large 
part with quantifying the complexity of a model class, 
and in particular with replacing a simple count of pa- 
rameters with a more rigorous notion of dimensionality 
for the space of models; for a general review of these ideas 
see Ref. and for discussion close in spirit to ours see 
Ref . . The important point here is that the dimension- 
ality of the model class, and hence the complexity of the 
class in the sense of learning theory, emerges as the coeffi- 
cient of the logarithmic divergence in /picd ■ Thus a mea- 
sure of complexity in learning problems can be derived 
from a more general dynamical systems or statistical me- 
chanics point of view, treating the data in the learning 
problem as a time series or one dimensional lattice. The 
logarithmic complexity class that we identify as being 
associated with finite dimensional models also arises, for 
example, at the Feigenbaum accumulation point in the 
period doubling route to chaos Q . 

As noted by Grassberger in his original discussion, 
there are time series for which the divergence of /prcd 
is stronger than a logarithm. We can construct an exam- 
ple by looking at the density function p in our learning 
problem above: finite dimensional models are associated 
with algebraic decay of the density as 13 ^ 0, and we can 
imagine that there are model classes in which this decay 
is more rapid, for example 

-> 0) « Aexp[-B/D^] , ^ > 0. (13) 

In this case it can be shown that the predictive informa- 
tion diverges very rapidly, as a sublinear power law, 

/prcd(iV) ^ Ar^/(^+i) . (14) 

One way that this scenario can arise is if the distribution 
Q{x) that we are trying to learn does not belong to any 
finite parameter family, but is itself drawn from a distri- 
bution that enforces a degree of smoothness ||^. Under- 
standably, stronger smoothness constraints have smaller 
powers (less to predict) than the weaker ones (more to 
predict). For example, a rather simple case of predicting 



a one dimensional variable that comes from a continuous 
distribution produces /pred(-^) ~ V^. 

As with the logarithmic class, we expect that power- 
law divergences in /pred are not restricted to the learn- 
ing problems that we have studied in detail. The gen- 
eral point is that such behavior will be seen in prob- 
lems where predictability over long scales, rather then 
being controlled by a fixed set of ever more precisely 
known parameters, is governed by a progressively more 
detailed description — effectively increasing the number 
of parameters — as we collect more data. This seems a 
plausible description of what happens in language, where 
rules of spelling allow us to predict forthcoming letters 
of long words, grammar binds the words together, and 
compositional unity of the entire text allows to make pre- 
dictions about the subject of the last page of the book 
after reading only the first few. Indeed, Shannon's clas- 
sic experiment on the predictability of English text (by 
human readers!) shows this behavior and more re- 
cently several groups have extracted power-law subex- 
tensive components from the numerical analysis of large 
corpora of text (see, for example, ||l^, [pl|). 

Interestingly, even without an explicit example, a sim- 
ple argument ensures existence of exponential densities 
and, therefore, power law predictive information models. 
If the number of parameters in a learning problem is not 
finite then in principle it is impossible to predict anything 
unless there is some appropriate regularization. If we let 
the number of parameters stay finite but become large, 
then there is more to be learned and correspondingly the 
predictive information grows in proportion to this num- 
ber. On the other hand, if the number of parameters 
becomes infinite without regularization, then the predic- 
tive information should go to zero since nothing can be 
learned. We should be able to see this happen in a regu- 
larized problem as the regularization weakens: eventually 
the regularization would be insufficient and the predictive 
information would vanish. The only way this can hap- 
pen is if the predictive information grows more and more 
rapidly with N as we weaken the regularization, until fi- 
nally it becomes extensive (equivalently, drops to zero) 
at the point where prediction becomes impossible. To 
realize this scenario we have to go beyond /prcd oc logT 
with /pred A^^/(^+i); the transition from increasing 
predictive information to zero occurs as /i — > 1. 

This discussion makes it clear that the predictive infor- 
mation (the subextensive entropy) distinguishes between 
problems of intuitively different complexity and thus, in 
accord to Grassberger's definitions 0|, is probably a good 
choice for a universal complexity measure. Can this in- 
tuition be made more precise? 

First we need to decide whether we want to attach mea- 
sures of complexity to a particular signal x(t) or whether 
we are interested in measures that are defined by an av- 
erage over the ensemble P[x{t)]. One problem in assign- 
ing complexity to single realizations is that there can be 
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atypical data streams. Second, Grassberger [Q in par- 
ticular has argued that our visual intuition about the 
complexity of spatial patterns is an ensemble concept, 
even if the ensemble is only implicit. The fact that we 
admit probabilistic models is crucial: even at a colloquial 
level, if we allow for probabilistic models then there is a 
simple description for a sequence of truly random bits, 
but if we insist on a deterministic model then it may 
be very complicated to generate precisely the observed 
string of bits. Furthermore, in the context of probabilis- 
tic models it hardly makes sense to ask for a dynamics 
that generates a particular data stream; we must ask for 
dynamics that generate the data with reasonable prob- 
ability, which is more or less equivalent to asking that 
the given string be a typical member of the ensemble 
generated by the model. All of these paths lead us to 
thinking not about single strings but about ensembles 
in the tradition of statistical mechanics, and so we shall 
search for measures of complexity that are averages over 
the distribution P[x{t)]. 

Once we focus on average quantities, we can provide 
an axiomatic proof (much in the spirit of Shannon's p^ ] 
arguments establishing entropy as a unique information 
measure) that links /prod to complexity. We can start by 
adopting Shannon's postulates as constraints on a mea- 
sure of complexity: if there are N equally likely signals, 
then the measure should be monotonic in N; if the sig- 
nal is decomposable into statistically independent parts 
then the measure should be additive with respect to this 
decomposition; and if the signal can be described as a 
leaf on a tree of statistically independent decisions then 
the measure should be a weighted sum of the measures at 
each branching point. We believe that these constraints 
are as plausible for complexity measures as for informa- 
tion measures, and it is well known from Shannon's orig- 
inal work that this set of constraints leaves the entropy 
as the only possibility. Since we are discussing a time de- 
pendent signal, this entropy depends on the duration of 
our sample, S{T). We know of course that this cannot be 
the end of the discussion, because we need to distinguish 
between randomness (entropy) and complexity. The path 
to this distinction is to introduce other constraints on our 
measure. 

First we notice that if the signal x is continuous, then 
the entropy is not invariant under transformations of x 
that do not mix point at different times (reparameteri- 
zations). It seems reasonable to ask that complexity be 
a function of the process we are observing and not of 
the coordinate system in which we choose to record our 
observations. However, that it is not the whole function 
S{T) which depends on the coordinate system for x; it is 
only the extensive component of the entropy that has this 
noninvariance. This can be seen more generally by not- 
ing that subextensive terms in the entropy contribute to 
the mutual information among different segments of the 
data stream (including the predictive information defined 



here), while the extensive entropy cannot; mutual infor- 
mation is coordinate invariant, so all of the noninvariance 
must reside in the extensive term. Thus, any measure 
complexity that is coordinate invariant must discard the 
extensive component of the entropy. 

If we continue along these lines, we can think about 
the asymptotic expansion of the entropy at large T. The 
extensive term is the first term in this series, and we have 
seen that it must be discarded. What about the other 
terms? In the context of predicting in a parameterized 
model, most of the terms in this series depend in detail 
on our prior distribution in parameter space, which might 
seem odd for a measure of complexity. More generally, if 
we consider transformations of the data stream x{t) that 
mix points within a temporal window of size r, then for 
T >> T the entropy S{T) may have subextensive terms 
which are constant, and these are not invariant under 
this class of transformations. On the other hand, if there 
are divergent subextensive terms, these are invariant un- 
der such temporally local transformations . So if we 
insist that measures of complexity be invariant not only 
under instantaneous coordinate transformations, but also 
under temporally local transformations, then we can dis- 
card both the extensive and the finite subextensive terms 
in the entropy, leaving only the divergent subextensive 
terms as a possible measure of complexity. 

To illustrate the purpose of these two extra conditions, 
we may think of the following example: measuring veloc- 
ity of a turbulent fluid flow at a given point. The condi- 
tion of invariance under reparameterizations means that 
the complexity is independent of the scale used by the 
speedometer. On the other hand, the second condition 
ensures that the temporal mixing due to the finiteness of 
the inertia of the speedometer's needle does not change 
the estimated complexity of the flow. 

In our view, these arguments (or their slight variation 
also presented in |^]) settle the question of the unique 
definition of complexity. Not only is the divergent subex- 
tensive component of the entropy the unique complexity 
measure, but it is also a universal one since it is con- 
nected in a straightforward way to many other measures 
that have arisen in statistics and in dynamical systems 
theory. A bit less straightforward is the connection to 
the Kolmogorov's definition that started the whole dis- 
cussion, but even this can also be made. 

To make this connection we follow the suggestion of 
Standish that one should focus not on the complex- 
ity of particular strings but of equivalence classes. In the 
present case it is natural to define an equivalence class 
of data x{—T < t < 0) as those data that generate indis- 
tinguishable conditional probability distributions for the 
future, P[x{t > 0)\x{-T < t < 0)]. If this conditional 
distribution has sufficient statistics, then there exists a 
compression of the past data x{—T < < < 0) into exactly 
/prcd(/^) bits while preserving all of the mutual informa- 
tion with the future. But this means that the ensemble of 
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data in an equivalence class can be described, on average, 
using exactly this many bits. Thus, for dynamics such 
that the prediction problem has sufficient statistics, the 
average Kolmogorov complexity of equivalence classes de- 
fined by the indistinguishability of predictions is equal to 
the predictive information. By the arguments above, pre- 
diction is the useful thing which we can do with a data 
stream, and so in this case it makes sense to say that the 
Kolmogorov complexity of representing the useful bits of 
data is equal to the predictive information. Note also 
that Kolmogorov complexity is defined only up to a con- 
stant depending on the computer used A computer 
independent definition requires ignoring constant terms 
and focusing only on asymptotic behavior. This agrees 
very well with our arguments above that identified only 
the divergent part of the predictive information with the 
complexity of a data stream. 

In the terminology suggested by Grassberger, the 
statement that the prediction problem has sufficient 
statistics means that the True Measure Complexity is 
equal to the Effective Measure Complexity simi- 
larly, the statistical complexity defined by Crutchfield and 
coworkers then also is equal to predictive informa- 
tion defined here. These are strong statements, and it is 
likely that they are not true precisely for most natural 
data streams. More generally one can ask for compres- 
sions that preserve the maximum fraction of the relevant 
(in this case, predictive) information, and our intuitive 
notion of data being "understandable" or "summariz- 
able" is that these selective compressions can be very 
efficient — here efficiency means that we can com- 
press the past into a description with length not much 
larger than /prcd(r) while preserving a finite fraction of 
the (diverging) information about the future; an exam- 
ple is when we summarize data by the parameters of the 
model that describes the underlying stochastic process. 
The opposite situation is illustrated by certain crypto- 
graphic codes, where the relevant information is accessi- 
ble (at best) only from the entire data set. Thus we can 
classify data streams by their predictive information, but 
additionally by whether this predictive information can 
be represented efficiently. For those data where efficient 
representation is possible, the predictive information and 
the mean Kolmogorov complexity of future-equivalent 
classes will be similar; with more care we can guarantee 
that these quantities are proportional as T — > oo. Per- 
haps Wigner's famous remarks about the unreasonable 
effectiveness of mathematics in the natural sciences could 
be rephrased as the conjecture that the data streams oc- 
curring in nature — although often complex as measured 
by their predictive information — nonetheless belong to 
this efficiently representable class. 
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